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We congratulate the authors for a well written 
and thoughtful survey of some of the literature in 
this area. They are mainly concerned with the ge- 
ometry and the computational learning aspects of 
the support vector machine (SVM). We will there- 
fore complement their review by discussing from the 
statistical function estimation perspective. In par- 
ticular, we will elaborate on the following points: 

• Kernel regularization is essentially a generalized 
ridge penalty in a certain feature space. 

• In practice, the effective dimension of the data 
kernel matrix is not always equal to n, even when 
the implicit dimension of the feature space is infi- 
nite; hence, the training data are not always per- 
fectly separable. 

• Appropriate regularization plays an important role 
in the success of the SVM. 

• The SVM is not fundamentally different from many 
statistical tools that our statisticians are familiar 
with, for example, penalized logistic regression. 

We acknowledge that many of the comments are 
based on our earlier paper Hastie, Rosset, Tibshi- 
rani and Zhu (2004). 

KERNEL REGULARIZATION AND THE 
GENERALIZED RIDGE PENALTY 

Given a positive definite kernel K(x,x'), where 
x,x' belong to a certain domain X, we consider the 
general function estimation problem 



Here £(■,■) is a convex loss function that describes 
the "closeness" between the observed data and the 
fitted model, and / is an element in the span of 
x'), x' £ X}. More precisely, / € TLk is a func- 
tion in the reproducing kernel Hilbert space Hk 
(RKHS) generated by K(-, •) (see Burges, 1998; Ev- 
geniou, Pontil and Poggio, 2000; and Wahba, 1999, 
for details). 

Suppose the positive definite kernel K(-,-) has a 
(possibly finite) eigenexpansion, 
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where 5\ > 82 > ■ ■ ■ > are the eigenvalues and ^(x)'s 
are the corresponding eigenfunctions. Elements of 
TCk have an expansion in terms of these eigenfunc- 
tions 
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with the constraint that 
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where is the norm induced by K(-,-). 

Then we can rewrite (1) as 
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and we can see that the regularization term ||/||^ 
in (1) can be interpreted as a generalized ridge penalty, 
where eigenfunctions with small eigenvalues in the 
expansion (2) get penalized more and vice versa. 

Formulation (3) seems to be an infinite dimen- 
sional optimization problem, but according to the 
representer theorem (Kimeldorf and Wahba, 1971; 
Wahba 1990), the solution is finite dimensional and 
has the form 
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Using the reproducing property of Hk, that is, 
(K(-,x.i), K(-,Xi')) = K(xj,Xj/), (3) also reduces to 
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a finite-dimensional criterion, 

(4) minL(y,/? + Ka) + Aa T Ka. 

A), " 

Here we use vector notation, K is the nxn data ker- 
nel matrix with elements equal to -fT(xj,Xj/),i,i' = 
1, . . . , n, and L(y, ft + Ka) = E"=i A) + /(**))■ 
We reparametrize (4) using the eigendecomposition 
of K = UDU T , where D is diagonal and U is or- 
thogonal. Let Ka = U/3, where j3 = DU T a. Then 

(4) becomes 

(5) minL(y,/5 + U/3) + A/3 T D- 1 /3. 

Now the columns of U are unit-norm basis functions 
that span the column space of K; and again, we 
see that those members that correspond to small 
eigenvalues (the elements of the diagonal matrix D) 
get heavily penalized and vice versa. 

In the machine learning community, people tend 
to view the kernel as providing an implicit map of x 
from X to a certain high-dimensional feature space, 
and K(-, •) computes inner products in this (possibly 
infinite-dimensional) feature space. Specifically, the 
features are 

hj(-K) = ^5j(j)j(^) or h(x) = (/ii(x),/i 2 (x),...) , 
and we have 

i^(x,x / ) = (h(x),h(x / )). 

Furthermore, let 9j = f3j /y/~5] and H = UD 1 / 2 . Then 
(3) and (5) become 

71 / OO \ OO 

(6) min£*(w,E*iM*) +A£^ 

°' i=l \ j=\ / j=l 

and 

(7) minL(y,ft + H0) + A0 T 0, 

respectively. This shows kernel regularization as an 
exact ridge penalty in the feature space, but un- 
like (3) and (5), it hides the fact that eigenfunctions 
are differentially penalized according to their corre- 
sponding eigenvalues. 

To illustrate the point, we consider a simple exam- 
ple. The data x^s are one-dimensional and were gen- 
erated from the standard Gaussian distribution (n = 
50). The radial kernel function K(x, x') = exp(— 7||x — 
x'\\ 2 ) was used, with 7 = 1. Figure 1 shows the eigen- 
values of the kernel matrix K. The left panel of Fig- 
ure 2 shows the first 16 eigenvectors of K (columns 
of U) and the right panel shows the corresponding 
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FlG. 1. Eigenvalues (on the log scale) of the data kernel ma- 
trix K. 

features (columns of H). As we can see from the left 
panel of Figure 2, eigenvectors with large eigenval- 
ues (hence get penalized less) tend to be smooth, 
while eigenvectors with small eigenvalues (hence get 
penalized more) tend to be wiggly; therefore, ||/||^ A . 
is also always interpreted as a roughness measure of 
the function /. We can also see from the right panel 
of Figure 2 that many of the features are "norm 
challenged," that is, they are squashed down dra- 
matically by their eigenvalues. 

EFFECTIVE DIMENSION OF THE DATA 
KERNEL MATRIX 

As we have seen in the previous section, the kernel 
K(-, •) maps x from its original input space to some 
high-dimensional feature space h(x). In the case of 
classification, it is sometimes argued that the im- 
plicit feature space can be infinite-dimensional (e.g., 
via the radial basis kernel) , which suggests that per- 
fect separation of the training data is always possi- 
ble. However, this is not always true in practice. 

To illustrate the point, we consider a two-class 
classification example. The data were generated from 
a pair of mixture Gaussian densities, described in de- 
tail by Hastie, Tibshirani and Friedman (2001). The 
radial kernel function K(x,x') = exp(— -7 ||x — x'|| 2 ) 
was used. Four different values of 7 (0.1,0.5,1 and 
5) were tried. For each of the values of 7, the SVM 
was fitted for a sequence of values of A, ranging from 
the most regularized model to the least regularized 
model. 
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Fig. 2. The left panel shows the eigenvectors of the data kernel matrix K and the right panel shows the corresponding features 
(eigenvectors scaled by the corresponding eigenvalues). 



Table 1 

Results for the mixture simulation example 



7 


5 


1 


0.5 


0.1 


Training errors 





12 


21 


33 


Effective rank 


200 


177 


143 


76 



for the four values of 7. The larger eigenvalues cor- 
respond in this case to smoother eigenfunctions, the 
smaller ones to rougher. The rougher eigenfunctions 
get penalized exponentially more than the smoother 
ones. Hence, for smaller values of 7, the effective di- 
mension of the feature space is truncated. 



We were at first surprised to discover that not 
all these sequences achieved zero training errors on 
the 200 training data points, at their least regular- 
ized fit. The minimal training errors and the cor- 
responding values for 7 are summarized in Table 
1. The second row of the table shows the effective 
rank of the data kernel matrix K (which we defined 
to be the number of eigenvalues greater than 10 -12 ). 
This 200 x 200 matrix has elements K(xi,Xi'),i,i' = 
1, . . . , 200. In this example, a full rank K is required 
to achieve perfect separation. Similar observations 
have also appeared in Williams and Seeger (2000) 
and Bach and Jordan (2002). 

This emphasizes again the fact that not all fea- 
tures in the feature map implied by K(-,-) are of 
equal stature (see the right panel in Figure 2); many 
of them are shrunken way down to zero. The regular- 
ization in (3) and (5) penalizes unit-norm eigenvec- 
tors by the inverse of their eigenvalues, which effec- 
tively annihilates some, depending on 7. Small 7 im- 
plies wide, flat kernels and a suppression of wiggly, 
rough functions. Figure 3 shows the eigenvalues of K 



THE NEED FOR CAREFUL REGULARIZATION 

The SVM has been very successful for the classifi- 
cation problem and gained a lot of attention in the 
machine learning community in the past ten years. 
Many papers have been published to explain why it 
performs so well. Most of this literature concentrates 
on the concept of margin. Various misclassification 
error bounds have been derived based on the mar- 
gin (Vapnik, 1995; Bartlett and Shawe- Taylor, 1999; 
Shawe- Taylor and Cristianini, 1999). 

However, our view is a little different from that 
based on the concept of margin. Several researchers 
have noted the relationship between the SVM and 
regularized function estimation in RKHS (Evgeniou, 
Pontil and Poggio, 2000; Wahba, 1999). The reg- 
ularized function estimation problem contains two 
parts: a loss function and a penalty term [e.g., (1)]. 
The SVM uses the so-called hinge loss function (see 
Figure 5). The margin maximizing property of the 
SVM derives from the hinge loss function. Hence 
margin maximization is by nature a nonregularized 
objective, and solving it in high-dimensional space 
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is likely to lead to overfitting and bad prediction 
performance. This has been observed in practice by 
many researchers, in particular Breiman (1999) and 
Marron and Todd (2002). 

The loss + penalty formulation emphasizes the 
role of regularization. In many situations we have 
sufficient features (e.g., gene expression arrays) to 
guarantee separation. We may nevertheless avoid 
the maximum margin separator (A J, 0) in favor of 
a more regularized solution. 

Figure 4 shows the test error as a function of A for 
the mixture data example. Here we see a dramatic 
range in the correct choice of A. When 7 = 5, the 
most regularized model is called for. On the other 



hand, when 7 = 0.1, we would want to choose among 
the least regularized models. Depending on the value 
of 7, the optimal A can occur at either end of the 
spectrum or anywhere in between, emphasizing the 
need for careful selection. 

CONNECTION WITH OTHER STATISTICAL 
TOOLS 

Last, we would like to comment on the connection 
between the SVM and some statistical tools that 
statisticians are familiar with. 

As we have seen in previous sections, what is spe- 
cial with the SVM is not the regularization term, 
but is rather the loss function, that is, the hinge 




Fig. 3. The eigenvalues (on the log scale) for the data kernel matrices K that correspond to the four values of 7. 
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Fig. 4. Test error curves for the mixture example, using four different values for the radial kernel parameter 7. Large values 
of X correspond to heavy regularization, small values of X to light regularization. Depending on the value of 7, the optimal X 
can occur at either end of the spectrum or anywhere in between, emphasizing the need for careful selection. 
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loss. Lin (2002) pointed out that the hinge loss is 
Bayes consistent, that is, the population minimizer 
of the loss function agrees with the Bayes rule in 
terms of classification. This is important in explain- 
ing the success of the SVM, because it implies that 
the SVM is trying to implement the Bayes rule. 

On the other hand, notice that the hinge loss and 
the binomial deviance have very similar shapes (see 
Figure 5): both increase linearly as yf gets very 
small (negative), and both encourage y and / to 
have the same sign. Hence it is reasonable to con- 
jecture that by replacing the hinge loss of the SVM 
with the binomial deviance, which is also Bayes con- 
sistent, we should be able to get a fitted model that 
performs similarly to the SVM. In fact, in Zhu and 
Hastie (2005), we show that under certain condi- 
tions, the classification boundary of the resulting 
penalized logistic regression (using the binomial de- 
viance) and that of the SVM coincide. Penalized lo- 
gistic regression has been studied by many statisti- 
cians (see Green and Silverman, 1994; Wahba, Gu, 
Wang and Chappell, 1995; and Lin et al., 2000, for 
details). We understand why it can work well. The 
same reasoning could be applied to the SVM. 

Penalized logistic regression is not the only model 
that performs similarly to the SVM; replacing the 
hinge loss with any sensible loss function will give 
a similar result, for example, the exponential loss 
function of boosting (Freund and Schapire, 1997) 



and the squared error loss (Zhang and Oles, 2001; 
Buhlmann and Yu, 2003). These loss functions are 
all Bayes consistent. The binomial deviance and the 
exponential loss are margin-maximizing loss func- 
tions, but the squared error loss is not. The distance 
weighted discrimination (Marron and Todd, 2002) is 
designed specifically for not maximizing the margin 
and works well with high-dimensional data, which 
in a way also justifies that margin maximization is 
not the key to the success of the SVM. 
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